Introduction

Delayed trains can be a major inconvenience for passengers and threaten the reliability of public transport. This becomes a more serious issue around New York City, which has the highest usage of public transport and highest population density in the United States. According to NJ Transit, 10% of the trains were delayed in 2023, leaving great room for improvement (NJ Transit, 2023).

At the present, NJ Transit Rail Riders must rely on published timetables and real-time delay alerts from NJ Transit when planning their trips. These alerts do not allow for advance planning and are not specific to individual stops, furthermore, current trip planning apps lack delay information. In this “business as usual” system, passengers may spend unnecessary time waiting on platforms in inclement weather or miss transfers at their arrival station when their train is delayed.

We are thus proposing a new app called TrackTrend, designed for regular commuters and open to all. It will provide delay forecasts to users one to two hours in advance of the scheduled arrival, allowing for better trip planning, less waiting time at stations, and less missed transfers. By improving the predictability of riding NJ Transit trains, it can also play apart in drawing more commuters back to public transportation.

As is illustrated in Figure 1, TrackTrend offers a user-friendly interface that helps users to navigate and check the delay information. When users enter the app, they can easily access the train delay information from the nearest station. From there, they can search for specific train lines or tap the train delay prediction for more detail. The details include the historic punctuality of each train line and the predicted delay for each stop, enabling the users to plan ahead for their journeys.

UI Figure 1: TrackTrend User Interface

Our analysis can be replicated on other rail systems that suffer from frequent delays and whose riders would benefit from delay forecasting. The origin-destination dataset in this analysis is in a format widely used in other public transport operations outside NJ. Our model results show relatively high accuracy and generalizability, making it appropriate and transferable for use in other contexts.

Data Wrangling and Exploratory Analysis

The dataset for this analysis was published on Kaggle.com and contains origin-destination data for all NJ Transit commuter trains and Amtrak Northeast Corridor trains in New Jersey, scraped from the “NJ Transit DepartureVision Real Time Train Status service”. We will look at data from September 2 to October 11, 2019. As our focus is on commuters in the New York City metro area, we have filtered the data set for NJ Transit trains in and out of Penn Station and Hoboken on weekdays only.

Setup

Load relevant libraries, datasets and settings.

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(mapview)
library(FNN)
library(caret)

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

tidycensus::census_api_key("c225553bc2ab27557539c33de3ebe928e378d800", overwrite = TRUE)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette5_2 <- c("#08519c", "#3182bd", "#6baed6", "#bdd7e7", "#eff3ff")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

Data Wrangling

To start data wrangling, we first import and select the origin-destination data for NJ Transit from Sep 2nd to Oct 11th, 2019, and remove Atlantic City Line as well as Princeton Shuttle since they are not operating in and out of NYC or Hoboken. We then introduce weather data from EWR airport to represent the weather conditions for all stations, including temperature, wind speed, gust, precipitation and visibility. We create a feature of “interval60” to split all the train departures into hourly intervals.

sep <- read.csv("2019_09.csv")
oct <- read.csv("2019_10.csv")

dat <- rbind(sep, oct) %>%
  na.omit()

dat$timestamp <- ymd_hms(dat$scheduled_time)
#dat$timestamp_ac <- ymd_hms(dat$actual_time)

dat <- dat[dat$timestamp >= ymd_hms("2019-09-02T00:00:00Z") &
                      dat$timestamp < ymd_hms("2019-10-14T00:00:00Z"), ]

stations_nj <- read.csv("Rail_Stations_of_NJ_Transit.csv")
dat2 <- dat %>%
  mutate(interval60 = floor_date(ymd_hms(timestamp), unit = "hour"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  filter(line != "Atl. City Line") %>%
   filter(line != "Princeton Shuttle") %>%
  filter(from != to) %>%
  na.omit()
geo_ny <- 
  get_acs(geography = "tract", 
          variables = "B01003_001",
          year = 2021, 
          state = "NY", 
          geometry = TRUE, 
          output = "wide")

geo_nj <- 
  get_acs(geography = "tract", 
          variables = "B01003_001",
          year = 2021, 
          state = "NJ", 
          geometry = TRUE, 
          output = "wide")

geo_pa <- 
  get_acs(geography = "tract", 
          variables = "B01003_001",
          year = 2021, 
          state = "PA", 
          geometry = TRUE, 
          output = "wide")
dat2 <- 
  left_join(dat2, stations_nj %>%
              as.data.frame() %>%
              select(LATITUDE, LONGITUDE, STATION_ID), by = c("from" = "STATION_ID"))
weather.Panel1 <- 
  riem_measures(station = "EWR", date_start = "2019-09-02", date_end = "2019-10-13")

weather.Panel <- 
  riem_measures(station = "EWR", date_start = "2019-09-02", date_end = "2019-10-13") %>%
  dplyr::select(valid, tmpf, p01i, sknt, relh, vsby, gust)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt),
              Humidity = max(relh),
              Visibility = max(vsby),
              Gust = max(gust)
              ) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

weather.Panel <- 
  weather.Panel %>%
mutate(
    Temperature = ifelse(interval60 == "2019-09-18T04:00:00Z", 58, Temperature),
    Humidity = ifelse(interval60 == "2019-09-18T04:00:00Z", 69.21, Humidity),
  )

Explortary Analysis

To further understand the data, we conduct exploratory analysis to identify the characteristics of the data over time and space and how they are related to each other. We first look into the number of stops made each day of the week, seeing that there is a higher concentration of train schedules during weekdays than weekends, supporting our use case to create more reliable prediction for commuters on weekdays (Figure 2). Typically, the delay for each stop is within 10 minutes. Some occasional significant delay may occur (Figure 3).

dat3 <- dat2 %>%
         group_by(interval60) %>%
         tally()

# argument for choosing weekdays
ggplot(dat2 %>%
         group_by(interval60, dotw)) +
         #tally())+
  geom_bar(aes(x = dotw))+
  labs(title="Number of stops made in an hour by day of the week (Sept 2-Oct 11)",
       caption = "Figure 2",
       x="Day", 
       y="Number of stops")+
  plotTheme

ggplot(dat2 %>%
         group_by(interval60) %>%
        summarise(average_delay = mean(delay_minutes))) +
  geom_line(aes(x = interval60, y = average_delay))+
  labs(title="Average delay minutes at each stop by hour",
        caption = "Figure 3",
       x="Date", 
       y="Average delay minutes")+
  plotTheme

The number of trains over the weekdays remains stable and consistent, with significant AM and PM peaks (Figure 4). Meanwhile, the delay minutes throughout the day vary, with longer delays during the evening and overnight, potentially due to the maintenance work overnight (Figures 5 and 6).

dat2 <- dat2 %>%
  filter(dotw == c("Mon", "Tue", "Wed", "Thu", "Fri"))

ggplot(dat2 %>% mutate(hour = hour(scheduled_time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Number of train stops in NJ, by day of the week",
        caption = "Figure 4",
       x="Hour", 
       y="Stop Counts")+
     plotTheme

dat3 <- dat2 %>%
      mutate(hour = hour(interval60)) %>%
      group_by(hour, dotw) %>%
        summarise(average_delay = mean(delay_minutes)) %>%
      na.omit()


dat3$dotw <- factor(dat3$dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri"))

ggplot(dat3, aes(x = hour, y = average_delay, color = dotw)) +
  geom_line() +
  labs(title="Average delay for each train stop in NJ, by day of the week",
        caption = "Figure 5",
       x="Hour", 
       y="Average delay") +
  plotTheme

ggplot(dat3 %>% 
         group_by(hour) %>%
        summarise(average_delay_weekdays = mean(average_delay)),
     aes(x = hour, y = average_delay_weekdays)) +
  geom_line() +
  labs(title="Average delay for each train stop in NJ on weekdays",
        caption = "Figure 6",
       x="Hour", 
       y="Average delay") +
  plotTheme

Zooming into each station, the delay minutes are mostly within 10 minutes in September and October 2019, with few outliers (Figure 7).

dat2 %>%
         group_by(interval60, from) %>% 
        summarise(average_delay = mean(delay_minutes)) %>%
    ggplot()+
  geom_histogram(aes(average_delay), binwidth = 1) +
  labs(title="Distribution of minutes of delay per hr by station. NJ, Sep-Oct, 2019",
        caption = "Figure 7",
       x="Minutes of delay", 
       y="Number of Stations")+
  plotTheme

# dat2 %>% 
#          mutate(hour = hour(scheduled_time),
#                 weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")) %>%
#         group_by(hour, weekend) %>%
#         summarise(average_delay = mean(delay_minutes)) %>%
#       na.omit() %>%
#   ggplot() +
#      geom_line(data = . %>% ungroup() %>% filter(.$weekend == "Weekend"), 
#              aes(x = hour, y = average_delay, color = "Weekend")) +
#      geom_line(data = . %>% ungroup() %>% filter(.$weekend == "Weekday"), 
#              aes(x = hour, y = average_delay, color = "Weekday")) +
#   labs(title="Average delay for each train stop in NJ, by day of the week",
#         caption = "Figure ",
#        x="", 
#        y="Average delay") +
#   plotTheme  

For each train line, the average delay minutes vary and usually are within 10 minutes, with the exception of the Raritan Valley Line (Figure 8). Breaking down the data to each period of time in a day, we found that there are higher delays for train stations outside NYC during AM Rush and Mid-day (Figure 9 & 10). This could be due to different service levels at further-out stations, or worse track conditions at the end of lines.

#dat3$dotw <- factor(dat3$dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri"))

dattt <- dat2 %>%
         mutate(hour = hour(interval60)) %>%
         group_by(line) %>%
         summarise(average_delay = mean(delay_minutes))

ggplot(dat2 %>%
         mutate(hour = hour(interval60)) %>%
         group_by(line, hour) %>%
         summarise(average_delay = mean(delay_minutes)), 
       aes(x = hour, y = average_delay, color = line)) +
  geom_line() +
  labs(title="Average delay for each train stop in NJ, by line",
        caption = "Figure 8",
       x="Hour", 
       y="Average delay") +
  plotTheme

dat5 <- dat2 %>% 
            mutate(hour = hour(interval60)) %>%
              group_by(line) %>%
     summarise(line_delay = mean(delay_minutes))

ggplot()+
  geom_sf(data = st_union(geo_nj) %>%
          st_transform(crs=4326))+
  geom_point(data = dat2 %>% 
            mutate(hour = hour(interval60)) %>%
              group_by(LATITUDE, LONGITUDE, line) %>%
            left_join(., dat5, by = "line"),
            aes(x=LONGITUDE, y = LATITUDE, color = line_delay), 
            fill = "transparent", alpha = 0.4, size = 1.5)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat2$LATITUDE), max(dat2$LATITUDE))+
  xlim(min(dat2$LONGITUDE), max(dat2$LONGITUDE))+
  labs(title="Minutes of delay per hr by line NJ, 2019",
        caption = "Figure 9")+
  mapTheme

# Filtering out NA values
dat2 <- dat2 %>% 
    filter(!is.na(LONGITUDE), !is.na(LATITUDE))

geo_nj_crop <- geo_nj %>% 
 st_crop(geo_nj, xmin = min(dat2$LONGITUDE), xmax = max(dat2$LONGITUDE),
ymin = min(dat2$LATITUDE), ymax = max(dat2$LATITUDE))

geo_ny_crop <- geo_ny %>% 
 st_crop(geo_nj, xmin = min(dat2$LONGITUDE), xmax = max(dat2$LONGITUDE),
ymin = min(dat2$LATITUDE), ymax = max(dat2$LATITUDE))

geo_pa_crop <- geo_pa %>% 
 st_crop(geo_nj, xmin = min(dat2$LONGITUDE), xmax = max(dat2$LONGITUDE),
ymin = min(dat2$LATITUDE), ymax = max(dat2$LATITUDE))

geo_crop <- rbind(geo_nj_crop, geo_ny_crop, geo_pa_crop)


dat2 <- dat2 %>% 
            mutate(hour = hour(scheduled_time),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))

ggplot()+
  geom_sf(data = geo_crop %>%
          st_transform(crs=4326), color = "grey")+
  geom_point(data = dat2  %>%
              group_by(from, LATITUDE, LONGITUDE, time_of_day) %>%
              summarise(average_delay = mean(delay_minutes)),
            aes(x=LONGITUDE, y = LATITUDE, size = average_delay, fill = average_delay), shape = 21, 
             alpha = 0.8)+
  scale_size_continuous(range = c(0.3, 6)) +
  # scale_colour_viridis(direction = -1,
  # discrete = FALSE, option = "D")+
  scale_fill_gradient2(low = 'white',  high = '#0868ac') + 
  coord_sf(xlim = c(min(dat2$LONGITUDE), max(dat2$LONGITUDE)), 
           ylim = c(min(dat2$LATITUDE), max(dat2$LATITUDE)))+
  facet_wrap(vars(time_of_day))+
  labs(title="Minutes of delay per hr by station. NJ, 2019", 
        caption = "Figure 10")+
  mapTheme  +
    theme(plot.title = element_text(size=20)) +
  guides(size = guide_legend(title = "Average delay by station", 
                             override.aes = list(alpha = 1, shape = 21)),
         fill = guide_legend(title = "Average delay by station"))

datw <- left_join(dat2, weather.Panel, by = "interval60")

We create time lag features for 1 hour, 2 hours and 1 day. One-hour lag has the highest correlation with average delay, which indicates that the 1-hour lag is more effective in predicting train delays (Table 1).

datw <- 
  datw %>% 
  arrange(from, interval60) %>% 
  mutate(lagHour = dplyr::lag(delay_minutes,1),
         lag2Hours = dplyr::lag(delay_minutes,2),
         lag1day = dplyr::lag(delay_minutes,24))
kable(as.data.frame(datw) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "delay_minutes"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -delay_minutes) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, delay_minutes),2)),
     caption = "Table 1") %>% kable_styling("striped")
Table 1
Variable correlation
lagHour 0.29
lag2Hours 0.12
lag1day 0.03

Weather conditions typically have a significant impact on railroad operation. However, it is discovered that there is no strong correlation between weather features and average delays (Figure 12). Wind speed and temperature are slightly more related to average delay. Strong winds may delay trains by knocking debris onto tracks or damaging overhead wires.

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Humidity)) + geom_line() + 
    labs(title="Humidity", x="Hour", y="Humidity") + plotTheme,
   ggplot(weather.Panel, aes(interval60,Visibility)) + geom_line() + 
    labs(title="Visibility", x="Hour", y="Visibility") + plotTheme,
    ggplot(weather.Panel, aes(interval60,Gust)) + geom_line() + 
    labs(title="Gust", x="Hour", y="Gust") + plotTheme,
  top="Figure 11: Weather Data - EWR - Sep-Oct, 2019")

correlation.long <-
  datw %>%
  dplyr::select(Temperature, Humidity, Precipitation, Wind_Speed, Visibility, Gust, delay_minutes) %>%
    gather(Variable, Value, -delay_minutes) %>%
  group_by(Variable, Value) %>%
  summarise(average_delay = mean(delay_minutes))

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, average_delay, use = "complete.obs"))

ggplot(correlation.long, aes(Value, average_delay)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, 
            aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Delay minutes as a function of weather",
        caption = "Figure 12") 

Model

We create four regression models to predict train delays. In each, the dependent variable is the delay time at a given station stop at a particular hour of the day. The first model uses only fixed time-effects as predictors: hourly interval, day of the week, temperature, and wind speed. The second uses only fixed space-effects: from station, to station, and train line. The third model uses both the fixed time- and space- effects. Finally, the fourth model uses time, space, and time lag features, looking at station delays 1, 2, and 24 hours in the past.

delay.Train <- filter(datw, week <= 39)
delay.Test <- filter(datw, week > 39)
reg1 <- 
  lm(delay_minutes ~  hour(interval60) + dotw + Temperature + Wind_Speed,  data=delay.Train)

summary(reg1)

reg2 <- 
  lm(delay_minutes ~  from + to + line,  data=delay.Train)
summary(reg2)

reg2.5 <- 
  lm(delay_minutes ~  line,  data=delay.Train)
summary(reg2.5)

reg3 <- 
  lm(delay_minutes ~  hour(interval60) + dotw + Temperature + Wind_Speed + line + from + to , 
     data=delay.Train)
summary(reg3)

reg4 <- 
  lm(delay_minutes ~  lagHour + lag2Hours + lag1day + hour(interval60) + dotw + Temperature + Wind_Speed + line + from + to , 
     data=delay.Train)
summary(reg4)
delay.Test.weekNest <- 
  delay.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

The data is then split into two parts. The models are trained on data from September 2nd to September 30th and tested on data from October 1st to October 11th. All four models produce a similar mean absolute error (MAE) at around 3 minutes, although the first model, using just space effects, is slightly less accurate (See Figure 13). We then plot the predicted delays against the observed delays over time (See Figure 14), and find that each model captures the temporal trend of delays fairly well, although they all fail to predict the extreme delay event at the end.

week_predictions <- 
 delay.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, delay_minutes),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week",
          caption = "Figure 13") +
  plotTheme

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from = map(data, pull, from)) %>%
    dplyr::select(interval60, from, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -from) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed train delay time series", subtitle = "NJ; A test set of 2 weeks",
            caption = "Figure 14",
           x = "Hour", y= "Delay minutes") +
      plotTheme

Next, a 100-fold cross-validation is performed on a model incorporating the station name, hourly interval, day of the week, wind speed, 1-hour time lag, and 2-hour time lag. This is a streamlined version of the fourth model described above, preserving space, time, and time-lag features (specifically 1- and 2-hour time lags, in line with our app use case) while reducing computation time for cross validation. We are comfortable with this approach, as our MAE results (Figure 13) showed little variation based on dependent variables used. The cross-validation yields a fairly normal distribution of error centered around 3.1 minutes, as seen in Figure 15, corroborating our earlier MAE results.

## define the variables we want
reg.vars <- c("from", "to", "hour(interval60)", "dotw", "Wind_Speed","line" ,
              "lagHour", "lag2Hours", "lag1day")

datw.cv <- datw %>%
  mutate(cvID = sample(round(nrow(datw) / 100),
                       size=nrow(datw), replace = TRUE))


fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.CV <-
  train(delay_minutes ~  from + hour(interval60) + dotw +
          Wind_Speed  + lagHour + lag2Hours,
        data = datw,
        method = "lm", trControl = fitControl, na.action = na.pass)

ggplot(reg.CV$resample, aes(x=MAE)) +
  geom_histogram(fill = "#08519c", color = "white") +
  labs(
    title = "Distribution of MAE Across 100-Fold Cross Validation",
    subtitle = "",
    caption = "Figure 15")

To assess the generalizability of our model, we examine MAE for each individual line and station. The model generalizes well across lines, with MAEs ranging from just over 2 to just over 4 minutes (See Table 2). While the Morristown line has a relatively high MAE, it also has the highest average observed delays. The model also generalizes fairly well across stations, although stations further from New York City tend to have higher prediction errors, especially during the AM rush (See Figure 16).

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from = map(data, pull, from), 
           start_lat = map(data, pull, LATITUDE), 
           start_lon = map(data, pull, LONGITUDE),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, from, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(from, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = geo_crop, color = "lightgrey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 1, alpha = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  facet_wrap(~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set",
        caption = "Figure 16")+
  mapTheme

kable(as.data.frame(week_predictions) %>%
  filter(Regression == "DTime_Space_FE_timeLags") %>%
          unnest() %>%
  group_by(line) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE)),  caption = "Table 2") %>% kable_styling("striped")
Table 2
line MAE
Bergen Co. Line 2.739433
Gladstone Branch 3.576761
Main Line 2.113863
Montclair-Boonton 3.038418
Morristown Line 4.346157
No Jersey Coast 2.966303
Northeast Corrdr 2.746860
Pascack Valley 2.112602
Raritan Valley 3.703619

Reflection and Conclusion

In this analysis we were able to predict train delays with a mean absolute error of about 3 minutes based on station locations, train lines, time of day, day of the week, weather conditions, and time lags of 1, 2, and 24 hours. The use case for track trend is to provide users forecasts 1-2 hours ahead of the scheduled train time, which our model does. As the time-lags did not noticeably improve the model (see Figures 13 and 14), the time of day and day of week must hold equal or greater significance, meaning our model can produce reliable delay predictions well ahead of the scheduled time, greatly improving the customer experience over “business as usual”.

Where can We Improve?

To further improve our model, we aim to use more precise weather data for train stations and use more training data that cover longer periods of time to obtain more accurate predictions. The current weather data we use is from a single reporting point at Newark Airport (EWR) and we applied this weather data to all stations across the whole region, which is relatively less accurate for each individual station. By acquiring more specific weather data in smaller regions, we can apply the weather features to the nearest stations to enhance the precision of our analysis. Furthermore, training the model over a longer time, one year for instance, can benefit our use case and modeling result. The 4 week training data we use currently may be less reliable to predict other months of a year for commuters.